Variant Discovery    ◾    121

in whole genome sequencing. However, several studies have shown that retaining PCR and

Illumina clustering duplicates does not cause significant artifacts as long as the library

complexity is sufficient. The duplicate alignments can be removed with “samtools rmdup”.

The following script creates a new subdirectory “dupRemBam” and then removes duplicate

alignments from the BAM files and stores the new BAM file in the new subdirectory:

mkdir dupRemBam

cd sortedbam

for i in $(ls *.bam);

do

samtools rmdup ${i} ../dupRemBam/${i} 2> ../dupRemBam/${i}.log

done;

cd ..

4.2.1.1.8  Varian calling with bcftools

It is time to use the bcftools for variant calling. The bcftools is a consensus variant caller as

discussed above. It takes a single or multiple BAM files as input and outputs a compressed

VCF file. The following script creates the subdirectory “variants” in the main project direc-

tory and then it uses “bcftools mpileup” for performing pileup and “bcftools call” to call

the variants for the piled-up reads:

mkdir variants

cd dupRemBam

ls *.bam | rev | rev > bam_list.txt

bcftools mpileup -Ou \

-f ../ref/GCF_009858895.2_ASM985889v3_genomic.fna \

-b bam_list.txt \

| bcftools call -vmO z \

-o ../variants/sarscov2.vcf.gz

cd ..

The VCF file that contains the variants (SNPs and InDels) will be saved in “variants” sub-

directory. For further analysis, the VCF file can be indexed using “tabix”, which is a generic

indexer for TAB-delimited genome position files.

tabix variants/sarscov2.vcf.gz

We can view the VCF file with “bcftool view” as follows:

bcftools view variants/sarscov2.vcf.gz | less -S

All steps from downloading the raw data to variant calling can be included in a single bash

file that can be executed on the command-line prompt of the Linux terminal. First, create

a new directory, make that directory the current working directory, and then create a file

with the run ids “ids.txt” as described above. Then, create a bash file “pipeline_bcftools.